These are the libraries I have used in order to complete the project
library(ggplot2)
library(corrplot)
library(Stack)
library(fastDummies)
library(dplyr)
library(leaflet)
library(caret)
library(randomForest)
The data is uploaded on a gist, and we store it in two data.frame objects called train and test
train <- read.csv("https://gist.githubusercontent.com/alombatti/f2cb8f1a244784999353309a97a6777f/raw/f38ee90641da97f62964ac10ea558564bfc6cb65/house_price_train.csv")
test <- read.csv("https://gist.githubusercontent.com/alombatti/1df87e336f159842e64ced9380b13e25/raw/49ea0fe6b42fb5d4868e1291dc8e47b20d2cc3bd/house_price_test.csv")
This dataset contains house sale prices for King County, which includes Seattle, Washington. It includes homes sold between May 2014 and May 2015. The dataset provided to us has already been devided in train set and test set, in the sizes of 17277 observations for the train and 4320 observations for the test.
The fields describing the data are:
Here is a glimpse of what the STRUCTURE of the data looks like (for the train set)
str(train)
## 'data.frame': 17277 obs. of 21 variables:
## $ id : num 9.18e+09 4.64e+08 2.22e+09 6.16e+09 6.39e+09 ...
## $ date : Factor w/ 372 levels "1/10/2015","1/12/2015",..: 211 331 290 20 226 223 94 34 76 294 ...
## $ price : num 225000 641250 810000 330000 530000 ...
## $ bedrooms : int 3 3 4 4 4 4 4 3 4 3 ...
## $ bathrooms : num 1.5 2.5 3.5 1.5 1.75 3.5 3.25 2.25 2.5 1.5 ...
## $ sqft_living : int 1250 2220 3980 1890 1814 3120 4160 1440 2250 2540 ...
## $ sqft_lot : int 7500 2550 209523 7540 5000 5086 47480 10500 6840 9520 ...
## $ floors : num 1 3 2 1 1 2 2 1 2 1 ...
## $ waterfront : int 0 0 0 0 0 0 0 0 0 0 ...
## $ view : int 0 2 2 0 0 0 0 0 0 0 ...
## $ condition : int 3 3 3 4 4 3 3 3 3 3 ...
## $ grade : int 7 10 9 7 7 9 10 8 9 8 ...
## $ sqft_above : int 1250 2220 3980 1890 944 2480 4160 1130 2250 1500 ...
## $ sqft_basement: int 0 0 0 0 870 640 0 310 0 1040 ...
## $ yr_built : int 1967 1990 2006 1967 1951 2008 1995 1983 1987 1959 ...
## $ yr_renovated : int 0 0 0 0 0 0 0 0 0 0 ...
## $ zipcode : int 98030 98117 98024 98155 98115 98115 98072 98023 98058 98115 ...
## $ lat : num 47.4 47.7 47.6 47.8 47.7 ...
## $ long : num -122 -122 -122 -122 -122 ...
## $ sqft_living15: int 1260 2200 2220 1890 1290 1880 3400 1510 2480 1870 ...
## $ sqft_lot15 : int 7563 5610 65775 8515 5000 5092 40428 8125 7386 6800 ...
We then check for NAs in the data we are going to use, both in the train and in the test.
print(length(which(is.na(train) == T)))
## [1] 0
print(length(which(is.na(test) == T)))
## [1] 0
As we can see, the data does not contain any null values, so we can proceed with our analysis.
In order to analyze the field DATE, and to be able to determine whether it is relevant in our analysis, we have to transform it so that we can use it. I thought of separating the fields year, month, and day, to, further in the analysis, determine if any of them is directly correlated with the price.
First, I change the field date from factor to date, and I store in a new dataframe the complete field of the date, its year, its month, and its day. Then, I merge the newly created dataset with the one from which I extracted the date field, and I reorder the column in a way that makes sense for the analyais.
I repeat the same procedure for the test and the train datasets.
train$date <- as.Date(train$date, "%m/%d/%Y")
df.date_train <- data.frame(date = train$date,
year = as.numeric(format(train$date, format = "%Y")),
month = as.numeric(format(train$date, format = "%m")),
day = as.numeric(format(train$date, format = "%d")))
train <- unique(merge(train, df.date_train))
rm(df.date_train)
train <- train[, c(2, 1, 22, 23, 24, 3:21)]
test$date <- as.Date(test$date, "%m/%d/%Y")
df.date_test <- data.frame(date = test$date,
year = as.numeric(format(test$date, format = "%Y")),
month = as.numeric(format(test$date, format = "%m")),
day = as.numeric(format(test$date, format = "%d")))
test <- unique(merge(test, df.date_test))
rm(df.date_test)
test <- test[, c(2, 1, 21, 22, 23, 3:20)]
Now I stack together the train and test dataset to procede with a more accurate analysis. I also eliminate the field id and date because they are not going to be relevant.
complete <- Stack(train, test)
complete$id = NULL
complete$date = NULL
The dataset now looks like this:
head(complete)
## year month day bedrooms bathrooms sqft_living sqft_lot floors waterfront
## 1 2014 5 2 3 2.50 1270 1180 3.0 0
## 2 2014 5 2 3 2.25 1970 35100 2.0 0
## 3 2014 5 2 3 2.50 2090 10834 1.0 0
## 4 2014 5 2 3 2.50 1770 1235 3.0 0
## 5 2014 5 2 3 3.00 1850 19966 1.0 0
## 6 2014 5 2 3 2.00 2710 4500 1.5 0
## view condition grade sqft_above sqft_basement yr_built yr_renovated
## 1 0 3 8 1270 0 2001 0
## 2 0 4 9 1970 0 1977 0
## 3 0 4 8 1360 730 1987 0
## 4 0 3 8 1600 170 2007 0
## 5 0 4 7 1090 760 1992 0
## 6 0 4 8 1880 830 1929 0
## zipcode lat long sqft_living15 sqft_lot15 price
## 1 98107 47.6697 -122.392 1320 1180 445700
## 2 98027 47.4635 -121.991 2340 35100 437500
## 3 98003 47.3537 -122.303 1750 8595 285000
## 4 98103 47.6965 -122.342 1680 1203 436110
## 5 98038 47.3493 -122.034 1410 6715 287200
## 6 98115 47.6747 -122.295 2060 4500 805000
Now I plot a correlation graph to see how much the different variables are correlated to our target variable PRICE
corr = cor(train[, 3:24])
corrplot(corr, method = "color",
outline = T,
cl.pos = "n",
rect.col = "black",
tl.col = "indianred4",
addCoef.col = "black",
number.digits = 2,
number.cex = 0.60,
tl.cex = 0.7,
cl.cex = 1,
col = colorRampPalette(c("red", "white", "green4")) (100))
From the matrix, we can see that we have many variables correlated somehow with price, some more and some less. I decided to proceed my analysis keeoing only the variables that have a correlation >= 0.15 with the target variable.
Therefore, going forward I will only keep:
Nonetheless, some considerations have to be made. The variables condition, yr_renovated, and zipcode are going to be converted as factors further on in the analysis. Therefore, we still include them in the listof variables we are going to keep because we can not really interepret their value now since it is numeric, and as numeric it might not affect the correlation with price.
complete2 <- select(complete, "price", "bedrooms", "bathrooms", "sqft_living", "floors",
"waterfront", "view", "grade", "sqft_above", "sqft_basement", "lat",
"sqft_living15", "condition", "yr_renovated", "zipcode")
This is an interactive map with all the houses based on their geographical position, coloured by their price.
complete$PriceBin <- cut(complete$price, c(0, 250e3, 500e3, 750e3, 1e6, 2e6, 999e6))
center_lon = median(complete$long, na.rm = TRUE)
center_lat = median(complete$lat, na.rm = TRUE)
factpal <- colorFactor(c("black", "blue", "yellow", "orange", "#0B5345", "red"),
complete$PriceBin)
leaflet(complete) %>% addProviderTiles("Esri.NatGeoWorldMap") %>%
addCircles(lng = ~ long, lat = ~ lat,
color = ~ factpal(PriceBin)) %>%
setView(lng = center_lon, lat = center_lat, zoom = 12) %>%
addLegend("bottomright", pal = factpal, values = ~ PriceBin,
title = "House Price Distribution",
opacity = 1)
complete$PriceBin <- NULL
In this section we are going to analyze the relationship of all the varaibles with our target, the variable price. Before doing this, I create a copy of the dataset, and I split it in train and test, based on the value of the variable price.
complete2 <- complete
train2 <- split(complete2, complete2$price > 0)
train2 <- train2[["TRUE"]]
test2 <- split(complete2, is.na(complete2$price))
test2 <- test2[["TRUE"]]
Here we look at the histrogram showing how many houses have been sold for each price.
graph1 <- ggplot(data = train2, aes(x = price)) +
geom_histogram(alpha = 0.8, fill = "#F1C40F") +
labs(x = "price", y = "houses", title = "Distribution of Prices") +
theme_bw()
graph1
Since the interval of price of the houses is very wide, it is smart to focus only on a small segment, to see the patten where most of the houses are. In this histogram, we only see the distribution of houses sold at a price up to 1.5 million US Dollars.
graph2 <- ggplot(data = train2, aes(x = price)) +
geom_histogram(alpha = 0.8, fill = "#F1C40F") +
scale_x_continuous(limits=c(0, 1.5e6)) +
labs(x = "price", y = "houses", title = "Distribution of Prices (up to 1.5 million)") +
theme_bw()
graph2
For the sole purpose of data visualization, I now add a comuln to my train dataframe rapresenting the logarithmic transformation of the price, so that the graphs will look prettier and more understandable.
train2$logprice = log(train2$price)
graph3 <- boxplot(train2[, "logprice"] ~ train2[, "bedrooms"],
main = "Price vs Bedrooms", col=c("#F1C40F","#336633"),
xlab="bedrooms", ylab="log (price)")
From this graph, we notice there are two odd values, when the bedrooms are 11 and when the bedrooms are 33.
print(subset(train2, train2$bedrooms > 10))
## year month day bedrooms bathrooms sqft_living sqft_lot floors
## 2866 2014 6 25 33 1.75 1620 6000 1
## 6003 2014 8 21 11 3.00 3000 4960 2
## waterfront view condition grade sqft_above sqft_basement yr_built
## 2866 0 0 5 7 1040 580 1947
## 6003 0 0 3 7 2400 600 1918
## yr_renovated zipcode lat long sqft_living15 sqft_lot15 price
## 2866 0 98103 47.6878 -122.331 1330 4700 640000
## 6003 1999 98106 47.5560 -122.363 1420 4960 520000
## logprice
## 2866 13.36922
## 6003 13.16158
In fact, we can see all the houses with a number of bedrooms bigger than 10 is just two, therefore we exclude these records from our analysis not to disturb our model.
graph4 <- boxplot(train2[, "logprice"] ~ train2[, "bathrooms"],
main = "Price vs Bathrooms", col=c("#F1C40F","#336633"),
xlab="bathrooms", ylab="log (price)")
The relationship looks linear overall. Nonetheless, the price of the house with 7.5 bathrooms is relatively lower than its neighbours. Therefore, we decide to eliminate that value.
graph5 <- boxplot(train2[, "logprice"] ~ train2[, "floors"],
main = "Price vs Floors", col=c("#F1C40F","#336633"),
xlab="floors", ylab="log (price)")
The relationship is linear.
graph6 <- boxplot(train2[, "logprice"] ~ train2[, "waterfront"],
main = "Price vs Waterfront", col=c("#F1C40F","#336633"),
xlab="waterfront", ylab="log (price)")
The relationsip is linear. Here, the 0 rapresents the houses without a waterfront, while the value 1 rapresents the houses with a waterfront. It is common sense that, if the house has a waterfront, its price is higher.
graph7 <- boxplot(train2[, "logprice"] ~ train2[, "view"],
main = "Price vs View", col=c("#F1C40F","#336633"),
xlab="view", ylab="log (price)")
The relationsip is linear. Here, the 0 rapresents the worst view, while the value 4 rapresents the best view. It is common sense that, the more beautiful the view, the higher the price of the house.
graph8 <- boxplot(train2[, "logprice"] ~ train2[, "condition"],
main = "Price vs Condition", col=c("#F1C40F","#336633"),
xlab="condition", ylab="log (price)")
The relationship is linear.
graph9 <- boxplot(train2[, "logprice"] ~ train2[, "grade"],
main = "Price vs Grade", col=c("#F1C40F","#336633"),
xlab="grade", ylab="log (price)")
The relationship is linear.
graph10 <- boxplot(train2[, "logprice"] ~ train2[, "sqft_living"],
main = "Price vs sqft_Living", col=c("#F1C40F","#336633"),
xlab="sqft_living", ylab="log (price)")
The relationship looks linear. The bigger the living room, the higher the price of the house.
graph11 <- boxplot(train2[, "logprice"] ~ train2[, "sqft_basement"],
main = "Price vs sqft_Basement", col=c("#F1C40F","#336633"),
xlab="sqft_basement", ylab="log (price)")
This graph is interesting. Here, we can see that a lot of houses do not have a basement at all, while tbe houses that have a basement, even if the size of the basement changes a lot, their prices do not really change.
length(train2$sqft_basement[train2$sqft_basement == 0])
## [1] 10521
length(train2$sqft_basement[train2$sqft_basement != 0])
## [1] 6756
Therefore, I decided to simplify this field and make it a boolean: it will take value 0 if the house does not have a basement, and it will take value 1 if the house has a basement, no matter its size.
graph12 <- boxplot(train2[, "logprice"] ~ train2[, "yr_renovated"],
main = "Price vs yr_Renovated", col=c("#F1C40F","#336633"),
xlab="yr_renovated", ylab="log (price)")
This graph is very similar to the previous one regarding the basement situation. As for having a basement or not, the same reasoning can be applied to the renovation of the house. It does not really matter when the house has been renovated, but what matters is if the house has been renovated at all.
length(train2$yr_renovated[train2$yr_renovated == 0])
## [1] 16538
length(train2$yr_renovated[train2$yr_renovated != 0])
## [1] 739
As we can see, most of the houses have not been renovated, so we will transform this field into boolean.
graph13 <- boxplot(train2[, "logprice"] ~ train2[, "zipcode"],
main = "Price vs Zipcode", col=c("#F1C40F","#336633"),
xlab="zipcode", ylab="log (price)")
We can not really see a pattern in this graph, so we will categorize the variable when modelling (as predicted before).
graph14 <- boxplot(train2[, "logprice"] ~ train2[, "sqft_living15"],
main = "Price vs sqft_Living(15)", col=c("#F1C40F","#336633"),
xlab="sqft_living(15)", ylab="log (price)")
As for the variable sqft_living, we see a direct relationship in this graph. As the area of the living room increases, the price ofthe house increases as well.
Given all the insights from the data visualization section, we can proceed in apply the changes to our dataset.
First, as seen in the graph for the bathrooms, we eliminate the bathroom with value 7.5 because its price was odd and probably it was a mistake.
train2 <- train2[-5419, ]
Then, as seen in the graph for the bedrooms, we only keep those house with 10 or less bedrooms, since the two houses with 11 and 33 bedrooms had wrong values for their price.
train2 <- train2[train2$bedrooms <= 10, ]
We dummify the sqft_basement field, transforming all the houses with a basement with value 1, and keeping all the houses without a basement with value 0.
train2$sqft_basement[train2$sqft_basement != 0] = 1
test2$sqft_basement[test2$sqft_basement != 0] = 1
We dummify the yr_renovated field, transforming all the houses that has been renovated with value 1, and keeping all the houses without renovation with value 0.
train2$yr_renovated[train2$yr_renovated != 0] = 1
test2$yr_renovated[test2$yr_renovated != 0] = 1
Lastly, we stack back together the train and test set.
train2$logprice <- NULL
complete2 <- Stack(train2, test2)
I create a copy of the complete dataset to proceed with the analysis.
complete3 <- complete2
Now, it is time to standardize the variables involving the squarefeet. I create a function that normalizes the value of the field between 0 and 1, 0 being the lowest value and 1 being the highest.
range01 <- function(x) {
(x-min(x)) / (max(x)-min(x))
}
Therefore, I can proceed with the normalization of the variables sqft_living, sqft_above, and sqft_living15.
complete3$sqft_living <- range01(complete3$sqft_living)
complete3$sqft_above <- scale(complete3$sqft_above)
complete3$sqft_living15 <- scale(complete3$sqft_living15)
In order to proceed with the design of the model, I transform as factor the following variables because they do not make sense as numeric in this context:
complete3$bedrooms <- as.factor(complete3$bedrooms)
complete3$bathrooms <- as.factor(complete3$bathrooms)
complete3$floors <- as.factor(complete3$floors)
complete3$waterfront <- as.factor(complete3$waterfront)
complete3$view <- as.factor(complete3$view)
complete3$grade <- as.factor(complete3$grade)
complete3$sqft_basement <- as.factor(complete3$sqft_basement)
complete3$yr_renovated <- as.factor(complete3$yr_renovated)
complete3$condition <- as.factor(complete3$condition)
complete3$zipcode <- as.factor(complete3$zipcode)
Another variable that does not make sense to take as numeric is lat. Nonetheless, we can not consider it as factor neither because of the too many values it takes.
Therefore, we bin the factor in 13 intervals.
breaks_lat <- c(-Inf, 47.20, 47.25, 47.30, 47.35, 47.40, 47.45, 47.50, 47.55, 47.60,
47.65, 47.70, 47.75, Inf)
names_lat <- c("47.15 - 47.20", "47.20 - 47.25", "47.25 - 47.30", "47.30 - 47.35",
"47.35 - 47.40", "47.40 - 47.45", "47.45 - 47.50", "47.50 - 47.55",
"47.55 - 47.60", "47.60 - 47.65", "47.65 - 47.70", "47.70 - 47.75",
"47.75 - 47.80")
complete3$lat <- cut(complete3$lat, breaks = breaks_lat, labels = names_lat)
In the end, I unstack the train and test dataset from the complete dataset to build our model.
train3 <- split(complete3, complete3$price > 0)
train3 <- train3[["TRUE"]]
test3 <- split(complete3, is.na(complete3$price))
test3 <- test3[["TRUE"]]
I decided to use a linear regression model.
First, I create a dataset with all the variables I am going to need in my model.
train4 <- train3[, c("bedrooms", "bathrooms", "sqft_living", "floors", "waterfront", "view",
"grade", "sqft_above", "sqft_living15", "lat", "condition",
"yr_renovated", "zipcode", "price")]
test4 <- test3
Then, I set up the settings for my model:
formula = price ~ .
fitControl <- trainControl(method="cv",number = 5)
And last, I run my model.
model1 <- train(formula,
data = train4,
method = "lm",
trControl = fitControl,
metric = "MAE")
The model scores an R2 of 0.84, which is a relatively good result.
summary(model1)
##
## Call:
## lm(formula = .outcome ~ ., data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2190582 -59141 1648 54520 3583476
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 113830.4 175708.9 0.648 0.517100
## bedrooms2 18314.4 12350.5 1.483 0.138124
## bedrooms3 26826.7 12385.8 2.166 0.030330 *
## bedrooms4 10704.3 12674.0 0.845 0.398351
## bedrooms5 3461.4 13376.7 0.259 0.795819
## bedrooms6 -41118.7 16297.5 -2.523 0.011645 *
## bedrooms7 -174762.9 30624.6 -5.707 1.17e-08 ***
## bedrooms8 -27843.7 48510.5 -0.574 0.565993
## bedrooms9 -215650.8 85973.6 -2.508 0.012139 *
## bedrooms10 -306878.0 85936.7 -3.571 0.000357 ***
## bathrooms0.75 2854.0 85907.9 0.033 0.973498
## bathrooms1 39490.9 83609.0 0.472 0.636699
## bathrooms1.25 -21275.3 98048.7 -0.217 0.828221
## bathrooms1.5 30505.9 83755.3 0.364 0.715693
## bathrooms1.75 29472.3 83716.6 0.352 0.724806
## bathrooms2 29266.7 83743.6 0.349 0.726734
## bathrooms2.25 47229.1 83782.8 0.564 0.572960
## bathrooms2.5 36145.9 83765.9 0.432 0.666102
## bathrooms2.75 33780.6 83870.1 0.403 0.687120
## bathrooms3 48649.3 83947.3 0.580 0.562244
## bathrooms3.25 101589.6 84065.0 1.208 0.226885
## bathrooms3.5 64331.5 84029.3 0.766 0.443935
## bathrooms3.75 115765.8 84934.2 1.363 0.172898
## bathrooms4 121190.7 85050.8 1.425 0.154198
## bathrooms4.25 218580.3 86014.1 2.541 0.011056 *
## bathrooms4.5 175160.7 85573.8 2.047 0.040684 *
## bathrooms4.75 458082.7 90560.2 5.058 4.27e-07 ***
## bathrooms5 145126.9 92667.0 1.566 0.117341
## bathrooms5.25 358749.2 95235.7 3.767 0.000166 ***
## bathrooms5.5 809693.9 104724.2 7.732 1.12e-14 ***
## bathrooms5.75 -72486.5 119719.1 -0.605 0.544874
## bathrooms6 452155.2 112699.2 4.012 6.04e-05 ***
## bathrooms6.25 615522.3 135355.4 4.547 5.47e-06 ***
## bathrooms6.5 134140.4 133142.7 1.007 0.313712
## bathrooms6.75 1341142.1 169579.1 7.909 2.76e-15 ***
## bathrooms7.75 2737817.5 177664.8 15.410 < 2e-16 ***
## bathrooms8 1696080.5 138313.6 12.263 < 2e-16 ***
## sqft_living 1412948.9 48020.2 29.424 < 2e-16 ***
## floors1.5 -6775.6 4422.9 -1.532 0.125557
## floors2 -29217.1 3833.9 -7.621 2.65e-14 ***
## floors2.5 82601.3 13867.5 5.956 2.63e-09 ***
## floors3 -89419.3 7922.3 -11.287 < 2e-16 ***
## floors3.5 126311.6 73590.5 1.716 0.086105 .
## waterfront1 603760.2 16366.7 36.890 < 2e-16 ***
## view1 87146.1 9293.6 9.377 < 2e-16 ***
## view2 68955.2 5658.1 12.187 < 2e-16 ***
## view3 153303.9 7663.6 20.004 < 2e-16 ***
## view4 272209.2 11889.8 22.894 < 2e-16 ***
## grade4 -164197.8 149366.7 -1.099 0.271656
## grade5 -202224.7 146923.0 -1.376 0.168716
## grade6 -211707.9 146778.5 -1.442 0.149218
## grade7 -202115.4 146818.0 -1.377 0.168642
## grade8 -172827.4 146859.7 -1.177 0.239284
## grade9 -94414.4 146928.3 -0.643 0.520500
## grade10 22656.2 147038.1 0.154 0.877545
## grade11 197241.4 147290.3 1.339 0.180545
## grade12 467112.1 148327.2 3.149 0.001640 **
## grade13 1155883.0 155807.2 7.419 1.24e-13 ***
## sqft_above 40930.0 3181.3 12.866 < 2e-16 ***
## sqft_living15 13026.7 2031.9 6.411 1.48e-10 ***
## `lat47.20 - 47.25` 7622.9 22796.7 0.334 0.738092
## `lat47.25 - 47.30` 56801.8 36727.6 1.547 0.121986
## `lat47.30 - 47.35` 47898.8 37606.8 1.274 0.202797
## `lat47.35 - 47.40` 53471.7 38931.0 1.374 0.169615
## `lat47.40 - 47.45` 97021.6 40068.0 2.421 0.015470 *
## `lat47.45 - 47.50` 93070.2 40942.4 2.273 0.023027 *
## `lat47.50 - 47.55` 120873.1 41577.6 2.907 0.003652 **
## `lat47.55 - 47.60` 169214.8 42027.7 4.026 5.69e-05 ***
## `lat47.60 - 47.65` 247363.5 43406.5 5.699 1.23e-08 ***
## `lat47.65 - 47.70` 230068.5 44366.6 5.186 2.18e-07 ***
## `lat47.70 - 47.75` 215981.8 45133.4 4.785 1.72e-06 ***
## `lat47.75 - 47.80` 189948.4 45685.8 4.158 3.23e-05 ***
## condition2 75954.1 33519.9 2.266 0.023468 *
## condition3 84927.3 31265.9 2.716 0.006608 **
## condition4 110606.6 31289.9 3.535 0.000409 ***
## condition5 162552.9 31473.7 5.165 2.44e-07 ***
## yr_renovated1 64266.3 5621.8 11.432 < 2e-16 ***
## zipcode98002 387.4 14042.9 0.028 0.977990
## zipcode98003 -4009.5 12776.3 -0.314 0.753658
## zipcode98004 581026.8 23743.7 24.471 < 2e-16 ***
## zipcode98005 153577.5 24632.8 6.235 4.63e-10 ***
## zipcode98006 143975.4 21462.7 6.708 2.03e-11 ***
## zipcode98007 84513.3 25535.3 3.310 0.000936 ***
## zipcode98008 83360.1 24384.1 3.419 0.000631 ***
## zipcode98010 54937.0 18201.7 3.018 0.002546 **
## zipcode98011 -13224.5 28974.8 -0.456 0.648100
## zipcode98014 -75293.9 28097.7 -2.680 0.007376 **
## zipcode98019 -61627.4 28894.3 -2.133 0.032951 *
## zipcode98022 34072.6 33717.7 1.011 0.312258
## zipcode98023 -33498.2 11182.9 -2.995 0.002744 **
## zipcode98024 56962.3 26493.6 2.150 0.031566 *
## zipcode98027 94804.6 20680.7 4.584 4.59e-06 ***
## zipcode98028 -17205.6 28245.4 -0.609 0.542435
## zipcode98029 108626.4 22223.2 4.888 1.03e-06 ***
## zipcode98030 5551.5 16215.0 0.342 0.732078
## zipcode98031 -14411.7 17173.9 -0.839 0.401391
## zipcode98032 -1255.8 19548.1 -0.064 0.948780
## zipcode98033 179587.3 25983.4 6.912 4.96e-12 ***
## zipcode98034 36171.6 27121.4 1.334 0.182322
## zipcode98038 31301.1 14011.4 2.234 0.025498 *
## zipcode98039 928698.7 34264.2 27.104 < 2e-16 ***
## zipcode98040 381226.3 22303.7 17.093 < 2e-16 ***
## zipcode98042 -2831.4 13661.1 -0.207 0.835810
## zipcode98045 54456.0 20428.5 2.666 0.007690 **
## zipcode98052 61047.3 25377.4 2.406 0.016157 *
## zipcode98053 26245.3 25754.5 1.019 0.308191
## zipcode98055 -2750.3 19430.3 -0.142 0.887440
## zipcode98056 27545.6 20465.8 1.346 0.178342
## zipcode98058 -11956.6 18124.7 -0.660 0.509465
## zipcode98059 31291.6 19913.6 1.571 0.116117
## zipcode98065 22391.8 21723.1 1.031 0.302656
## zipcode98070 -3765.3 20653.9 -0.182 0.855345
## zipcode98072 9780.9 28235.2 0.346 0.729041
## zipcode98074 -19471.8 24381.0 -0.799 0.424507
## zipcode98075 38717.0 22365.8 1.731 0.083455 .
## zipcode98077 -26557.8 28837.5 -0.921 0.357091
## zipcode98092 -21326.8 11968.5 -1.782 0.074781 .
## zipcode98102 301006.6 28451.8 10.580 < 2e-16 ***
## zipcode98103 169097.9 25816.1 6.550 5.91e-11 ***
## zipcode98105 306119.8 27120.5 11.287 < 2e-16 ***
## zipcode98106 50726.4 21616.5 2.347 0.018954 *
## zipcode98107 182195.5 26988.5 6.751 1.52e-11 ***
## zipcode98108 27535.9 23297.5 1.182 0.237252
## zipcode98109 344538.9 28018.8 12.297 < 2e-16 ***
## zipcode98112 425357.6 25216.5 16.868 < 2e-16 ***
## zipcode98115 164599.3 25870.1 6.363 2.03e-10 ***
## zipcode98116 178734.3 22526.9 7.934 2.25e-15 ***
## zipcode98117 152520.8 25918.5 5.885 4.06e-09 ***
## zipcode98118 73299.3 21040.6 3.484 0.000496 ***
## zipcode98119 296744.3 26359.0 11.258 < 2e-16 ***
## zipcode98122 145489.5 25202.7 5.773 7.93e-09 ***
## zipcode98125 38354.3 27440.9 1.398 0.162219
## zipcode98126 109243.6 21510.9 5.079 3.84e-07 ***
## zipcode98133 6557.5 27385.3 0.239 0.810756
## zipcode98136 178382.9 22213.9 8.030 1.04e-15 ***
## zipcode98144 162303.9 22428.3 7.237 4.80e-13 ***
## zipcode98146 29240.0 21014.5 1.391 0.164116
## zipcode98148 17989.5 25969.6 0.693 0.488499
## zipcode98155 -9493.0 27686.1 -0.343 0.731693
## zipcode98166 27681.2 19918.9 1.390 0.164640
## zipcode98168 2693.5 20735.3 0.130 0.896649
## zipcode98177 78350.1 28406.9 2.758 0.005819 **
## zipcode98178 -23646.2 20999.4 -1.126 0.260163
## zipcode98188 -14923.4 21528.2 -0.693 0.488192
## zipcode98198 -20883.2 16449.2 -1.270 0.204261
## zipcode98199 209708.8 25175.0 8.330 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 144300 on 17128 degrees of freedom
## Multiple R-squared: 0.8463, Adjusted R-squared: 0.845
## F-statistic: 650.6 on 145 and 17128 DF, p-value: < 2.2e-16
Then, we fit the model to our dataset to predict the price of every house in the train dataset, to see if they reflect the actual value.
train4_fit <- fitted(model1)
train4$pred <- train4_fit
We have to score our model on the MAPE, which is the mean absolute percentage error. The lowest this number, the better the model.
mape <- function(real, predicted) {
return(mean(abs((real - predicted) / real)))
}
mape(train4$price, train4$pred)
## [1] 0.1668104
We get a MAPE score of 0.167, which is very good for our model.
Now I try a random forest to see if my score increases.
I set model with 500 trees.
train5 <- train3[, c("bedrooms", "bathrooms", "sqft_living", "floors", "waterfront", "view",
"grade", "sqft_above", "sqft_living15", "lat", "condition",
"yr_renovated", "price")]
test5 <- test3
model2 <- randomForest(formula,
data = train5)
model2
##
## Call:
## randomForest(formula = formula, data = train5)
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 4
##
## Mean of squared residuals: 26398897860
## % Var explained: 80.35
The model scores an R2 of 0.8044, not better than the linear model I run before (1).
Last, with our model we predictthe prices in the test dataset.
test4$price <- predict(model1, test4)
Now I try a linear regression with the log of prices.
First, I create a dataset with all the variables I am going to need in my model.
train6 <- train3[, c("bedrooms", "bathrooms", "sqft_living", "floors", "waterfront", "view",
"grade", "sqft_above", "sqft_living15", "lat", "condition",
"yr_renovated", "zipcode", "price")]
test6 <- test3
Then, I transform the variables prices with the log
train6$price <- log(train6$price)
test6$price <- log(test6$price)
And last, I run my model.
model3 <- train(formula,
data = train6,
method = "lm",
trControl = fitControl,
metric = "MAE")
This model turns out to be the best, with an R2 of 0.88.
summary(model3)
##
## Call:
## lm(formula = .outcome ~ ., data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.26579 -0.09984 0.00356 0.10231 1.12159
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.808937 0.220018 53.673 < 2e-16 ***
## bedrooms2 0.041196 0.015465 2.664 0.007733 **
## bedrooms3 0.052312 0.015509 3.373 0.000745 ***
## bedrooms4 0.049575 0.015870 3.124 0.001788 **
## bedrooms5 0.028251 0.016750 1.687 0.091695 .
## bedrooms6 -0.003276 0.020407 -0.161 0.872457
## bedrooms7 -0.128867 0.038347 -3.361 0.000780 ***
## bedrooms8 -0.081525 0.060744 -1.342 0.179575
## bedrooms9 -0.051054 0.107654 -0.474 0.635334
## bedrooms10 -0.168775 0.107608 -1.568 0.116800
## bathrooms0.75 0.252896 0.107572 2.351 0.018737 *
## bathrooms1 0.285058 0.104693 2.723 0.006480 **
## bathrooms1.25 0.307928 0.122774 2.508 0.012148 *
## bathrooms1.5 0.300974 0.104876 2.870 0.004112 **
## bathrooms1.75 0.335163 0.104828 3.197 0.001390 **
## bathrooms2 0.330657 0.104862 3.153 0.001617 **
## bathrooms2.25 0.358195 0.104911 3.414 0.000641 ***
## bathrooms2.5 0.361940 0.104889 3.451 0.000561 ***
## bathrooms2.75 0.366007 0.105020 3.485 0.000493 ***
## bathrooms3 0.360288 0.105117 3.428 0.000611 ***
## bathrooms3.25 0.380001 0.105264 3.610 0.000307 ***
## bathrooms3.5 0.379947 0.105219 3.611 0.000306 ***
## bathrooms3.75 0.388190 0.106352 3.650 0.000263 ***
## bathrooms4 0.358334 0.106498 3.365 0.000768 ***
## bathrooms4.25 0.384542 0.107705 3.570 0.000357 ***
## bathrooms4.5 0.350517 0.107153 3.271 0.001073 **
## bathrooms4.75 0.357969 0.113397 3.157 0.001598 **
## bathrooms5 0.340433 0.116035 2.934 0.003352 **
## bathrooms5.25 0.334061 0.119252 2.801 0.005095 **
## bathrooms5.5 0.326108 0.131133 2.487 0.012897 *
## bathrooms5.75 0.147547 0.149909 0.984 0.325009
## bathrooms6 0.236772 0.141119 1.678 0.093400 .
## bathrooms6.25 0.133357 0.169489 0.787 0.431397
## bathrooms6.5 0.372262 0.166718 2.233 0.025569 *
## bathrooms6.75 -0.266693 0.212343 -1.256 0.209149
## bathrooms7.75 -0.225408 0.222467 -1.013 0.310970
## bathrooms8 -0.524727 0.173193 -3.030 0.002451 **
## sqft_living 1.875299 0.060130 31.188 < 2e-16 ***
## floors1.5 0.017854 0.005538 3.224 0.001267 **
## floors2 -0.049856 0.004801 -10.385 < 2e-16 ***
## floors2.5 -0.018280 0.017365 -1.053 0.292495
## floors3 -0.156986 0.009920 -15.825 < 2e-16 ***
## floors3.5 -0.175153 0.092148 -1.901 0.057348 .
## waterfront1 0.456475 0.020494 22.274 < 2e-16 ***
## view1 0.114449 0.011637 9.835 < 2e-16 ***
## view2 0.104969 0.007085 14.816 < 2e-16 ***
## view3 0.180525 0.009596 18.812 < 2e-16 ***
## view4 0.283162 0.014888 19.019 < 2e-16 ***
## grade4 -0.642328 0.187033 -3.434 0.000595 ***
## grade5 -0.504105 0.183973 -2.740 0.006148 **
## grade6 -0.406596 0.183792 -2.212 0.026962 *
## grade7 -0.307781 0.183842 -1.674 0.094117 .
## grade8 -0.218095 0.183894 -1.186 0.235645
## grade9 -0.117054 0.183980 -0.636 0.524633
## grade10 -0.064642 0.184117 -0.351 0.725523
## grade11 -0.030581 0.184433 -0.166 0.868309
## grade12 0.023109 0.185732 0.124 0.900983
## grade13 0.176309 0.195098 0.904 0.366170
## sqft_above 0.071466 0.003984 17.940 < 2e-16 ***
## sqft_living15 0.052324 0.002544 20.565 < 2e-16 ***
## `lat47.20 - 47.25` 0.039729 0.028545 1.392 0.164010
## `lat47.25 - 47.30` 0.165263 0.045989 3.594 0.000327 ***
## `lat47.30 - 47.35` 0.164443 0.047090 3.492 0.000480 ***
## `lat47.35 - 47.40` 0.180708 0.048748 3.707 0.000210 ***
## `lat47.40 - 47.45` 0.267589 0.050172 5.333 9.76e-08 ***
## `lat47.45 - 47.50` 0.245121 0.051267 4.781 1.76e-06 ***
## `lat47.50 - 47.55` 0.350469 0.052062 6.732 1.73e-11 ***
## `lat47.55 - 47.60` 0.458231 0.052626 8.707 < 2e-16 ***
## `lat47.60 - 47.65` 0.509781 0.054353 9.379 < 2e-16 ***
## `lat47.65 - 47.70` 0.505051 0.055555 9.091 < 2e-16 ***
## `lat47.70 - 47.75` 0.474429 0.056515 8.395 < 2e-16 ***
## `lat47.75 - 47.80` 0.420774 0.057207 7.355 1.99e-13 ***
## condition2 0.085037 0.041973 2.026 0.042780 *
## condition3 0.218205 0.039150 5.573 2.53e-08 ***
## condition4 0.268080 0.039180 6.842 8.06e-12 ***
## condition5 0.329011 0.039411 8.348 < 2e-16 ***
## yr_renovated1 0.082993 0.007039 11.790 < 2e-16 ***
## zipcode98002 -0.039612 0.017584 -2.253 0.024291 *
## zipcode98003 0.002160 0.015998 0.135 0.892619
## zipcode98004 0.761151 0.029731 25.601 < 2e-16 ***
## zipcode98005 0.394195 0.030845 12.780 < 2e-16 ***
## zipcode98006 0.348343 0.026875 12.962 < 2e-16 ***
## zipcode98007 0.307884 0.031975 9.629 < 2e-16 ***
## zipcode98008 0.300624 0.030533 9.846 < 2e-16 ***
## zipcode98010 0.240369 0.022792 10.546 < 2e-16 ***
## zipcode98011 0.154543 0.036282 4.260 2.06e-05 ***
## zipcode98014 0.005783 0.035183 0.164 0.869450
## zipcode98019 0.026112 0.036181 0.722 0.470489
## zipcode98022 0.183082 0.042220 4.336 1.46e-05 ***
## zipcode98023 -0.048464 0.014003 -3.461 0.000539 ***
## zipcode98024 0.205739 0.033175 6.202 5.71e-10 ***
## zipcode98027 0.324987 0.025896 12.550 < 2e-16 ***
## zipcode98028 0.125683 0.035368 3.554 0.000381 ***
## zipcode98029 0.305300 0.027827 10.971 < 2e-16 ***
## zipcode98030 0.024725 0.020304 1.218 0.223342
## zipcode98031 -0.015075 0.021505 -0.701 0.483313
## zipcode98032 -0.053609 0.024478 -2.190 0.028530 *
## zipcode98033 0.424382 0.032536 13.044 < 2e-16 ***
## zipcode98034 0.213127 0.033961 6.276 3.56e-10 ***
## zipcode98038 0.130213 0.017545 7.422 1.21e-13 ***
## zipcode98039 0.866585 0.042905 20.198 < 2e-16 ***
## zipcode98040 0.584275 0.027928 20.921 < 2e-16 ***
## zipcode98042 0.028910 0.017106 1.690 0.091036 .
## zipcode98045 0.250503 0.025580 9.793 < 2e-16 ***
## zipcode98052 0.281106 0.031777 8.846 < 2e-16 ***
## zipcode98053 0.232441 0.032249 7.208 5.93e-13 ***
## zipcode98055 0.031107 0.024330 1.279 0.201071
## zipcode98056 0.151339 0.025627 5.906 3.58e-09 ***
## zipcode98058 0.055738 0.022695 2.456 0.014061 *
## zipcode98059 0.207716 0.024935 8.330 < 2e-16 ***
## zipcode98065 0.181036 0.027201 6.655 2.91e-11 ***
## zipcode98070 0.314529 0.025862 12.162 < 2e-16 ***
## zipcode98072 0.197403 0.035355 5.583 2.39e-08 ***
## zipcode98074 0.184361 0.030529 6.039 1.58e-09 ***
## zipcode98075 0.227151 0.028006 8.111 5.36e-16 ***
## zipcode98077 0.170110 0.036110 4.711 2.49e-06 ***
## zipcode98092 0.018911 0.014987 1.262 0.207011
## zipcode98102 0.605860 0.035627 17.006 < 2e-16 ***
## zipcode98103 0.473939 0.032326 14.661 < 2e-16 ***
## zipcode98105 0.590728 0.033960 17.395 < 2e-16 ***
## zipcode98106 0.135825 0.027068 5.018 5.27e-07 ***
## zipcode98107 0.500034 0.033794 14.796 < 2e-16 ***
## zipcode98108 0.107507 0.029173 3.685 0.000229 ***
## zipcode98109 0.652427 0.035084 18.596 < 2e-16 ***
## zipcode98112 0.680156 0.031575 21.541 < 2e-16 ***
## zipcode98115 0.463131 0.032394 14.297 < 2e-16 ***
## zipcode98116 0.447828 0.028208 15.876 < 2e-16 ***
## zipcode98117 0.462297 0.032454 14.244 < 2e-16 ***
## zipcode98118 0.220326 0.026346 8.363 < 2e-16 ***
## zipcode98119 0.624008 0.033006 18.906 < 2e-16 ***
## zipcode98122 0.440330 0.031558 13.953 < 2e-16 ***
## zipcode98125 0.248613 0.034361 7.235 4.84e-13 ***
## zipcode98126 0.311791 0.026935 11.575 < 2e-16 ***
## zipcode98133 0.154186 0.034291 4.496 6.96e-06 ***
## zipcode98136 0.470363 0.027816 16.910 < 2e-16 ***
## zipcode98144 0.343314 0.028084 12.224 < 2e-16 ***
## zipcode98146 0.117542 0.026314 4.467 7.99e-06 ***
## zipcode98148 0.052616 0.032519 1.618 0.105672
## zipcode98155 0.130782 0.034668 3.772 0.000162 ***
## zipcode98166 0.219213 0.024942 8.789 < 2e-16 ***
## zipcode98168 -0.030499 0.025964 -1.175 0.240150
## zipcode98177 0.301914 0.035570 8.488 < 2e-16 ***
## zipcode98178 -0.006055 0.026295 -0.230 0.817897
## zipcode98188 0.006968 0.026957 0.259 0.796023
## zipcode98198 0.014152 0.020597 0.687 0.492052
## zipcode98199 0.502300 0.031523 15.934 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1807 on 17128 degrees of freedom
## Multiple R-squared: 0.8839, Adjusted R-squared: 0.8829
## F-statistic: 899.4 on 145 and 17128 DF, p-value: < 2.2e-16
Then, we fit the model to our dataset to predict the price of every house in the train dataset, to see if they reflect the actual value.
train6_fit <- fitted(model3)
train6$pred <- train6_fit
Now we turn the prices back to normal, using the exponential function.
train6$price <- exp(train6$price)
train6$pred <- exp(train6$pred)
We have to score our model on the MAPE, which is the mean absolute percentage error. The lowest this number, the better the model.
mape <- function(real, predicted) {
return(mean(abs((real - predicted) / real)))
}
mape(train6$price, train6$pred)
## [1] 0.1348133
We get a MAPE score of 0.135, which is very good for our model.
The third model was the best we got so far, with an R2 of 0.88 and a MAPE of 0.13, so we decide to deploy this one.
Now we predict the prices on the test dataset.
test6$price <- predict(model3, test6)
And I transform my test dataframe with just the ID and the price predicted.
test6 <- test6[, c("price")]
test <- test[, c("id")]
sumbission <- data.frame(test, test6)
names(sumbission) <- c("id", "price")
Last, I export my dataframe as a csv file.
write.csv(sumbission, "Alberto_Lombatti_submission.csv")